【Day 09】Hugging Face 影像&語音生成實戰

2024 iThome 鐵人賽

DAY 9

生成式 AI

T 大使 AI 之旅系列第 9 篇

16th鐵人賽

我的狗狗叫饅頭

2024-08-13 23:42:26

440 瀏覽

分享至

前情提要

上一篇文章了解了 Pipeline、Tokenizer、Model 的應用，最後也整合成一個實戰。那了解了文本的部分，今天就試試看影像和語音的生成。

幹話一下

其實我原本沒有打算要做語音生成的部分，因為沒接觸過。但是我被主管要求寫一個投資策略，然後要以錄音的方式說明我的投資策略和程式碼。但是我真的是很不會當聲優，因為被要求要生出音檔，還是來求救 AI 了，因此玩了一下語音生成的部分，想說可以一併分享。

圖像生成

相比前幾天從 GitHub 直接下載 Stable Diffusion，今天我從 Hugging Face 上使用 Stable Diffusion 的預訓練模型。那我這次選的是 runwayml/stable-diffusion-v1-5。

# 記得先安裝 diffusers 套件
import torch
from diffusers import StableDiffusionPipeline

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("mps")

prompt = "a photo of a schnauzers"
image = pipe(prompt).images[0]
image.save("photo.png")

程式碼結果探討 🧐：

這是非常直覺的程式碼，匯入套件 -> Pipeline 載入預訓練模型並使用 GPU -> 設定 Prompt 導入 Pipeline -> 儲存圖片。透過這種很簡單的方式就可以生成指令對應的圖片了！

影像生成

影像生成也是很酷的部分，原本也沒有想說要做，但是既然 Hugging Face 上面有，那就來分享一下，那我自己本機是跑不出來，還是得依賴 Colab 的 cuda。

import torch
from diffusers import AnimateDiffPipeline, MotionAdapter, EulerDiscreteScheduler
from diffusers.utils import export_to_gif
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

device = "cuda"
dtype = torch.float16

step = 4 # Options: [1,2,4,8]
repo = "ByteDance/AnimateDiff-Lightning"
ckpt = f"animatediff_lightning_{step}step_diffusers.safetensors"
base = "emilianJR/epiCRealism" # Choose to your favorite base model.

adapter = MotionAdapter().to(device, dtype)
adapter.load_state_dict(load_file(hf_hub_download(repo ,ckpt), device=device))
pipe = AnimateDiffPipeline.from_pretrained(base, motion_adapter=adapter, torch_dtype=dtype).to(device)
pipe.scheduler = EulerDiscreteScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing", beta_schedule="linear")

output = pipe(prompt="A schnauzer running on the grass", guidance_scale=1.0, num_inference_steps=step)
export_to_gif(output.frames[0], "animation.gif")

程式碼結果探討 🧐：

那因為本機跑不起來我就沒有深入研究程式碼的部分，只要結果順利產出就好哈哈，有機會後面再來分享這個影像生成技術。那麼因為產出的檔案是 gif 檔，然後大小超過 1 MB 沒辦法上傳，所以以截圖來代替。

語音生成

語音生成的部分目前 Hugging Face 的模型都是英文為主，沒有什麼中文的可以使用，所以都會使用英文來實戰。那我選了兩個模型，分別是 microsoft/speecht5_tts & parler-tts/parler-tts-mini-v1，後續再解釋為什麼要挑兩個。

microsoft/speecht5_tts

# 記得安裝 soundfile 和 datasets 套件
# 匯入套件
import torch
import soundfile as sf
from datasets import load_dataset
from transformers import pipeline

# 指派 Pipeline 任務
synthesiser = pipeline(task = "text-to-speech", model = "microsoft/speecht5_tts", device = "mps")

# 從 datasets 載入 Speaker 聲音
embeddings_dataset = load_dataset(path = "Matthijs/cmu-arctic-xvectors", split = "validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

# 輸入文本和指定前面設定好的 Speaker
speech = synthesiser("Hello, my dog is cute.", forward_params={"speaker_embeddings": speaker_embedding})

# 匯出結果
sf.write("speech.wav", speech["audio"], samplerate = speech["sampling_rate"])

程式碼結果探討 🧐：

音檔不知道怎麼放在 Markdown 分享 🤣

安裝 soundfile 和 datasets 套件並匯入會使用到的套件
指派 Pipeline 任務，與昨天的 text-generation 類似
從那個 Hugging Face 資料集 Matthijs/cmu-arctic-xvectors 載入不同的 Speaker 聲音，總共有 7931 種聲音，他也有分多種的英文種類。
輸入要轉成語音的文本，然後指定 Speaker
輸出成 wav 音檔

parler-tts/parler-tts-mini-v1

這邊分享一個狀況，在 parler-tts/parler-tts-mini-v1 有寫說要安裝 Library，如下圖，然後就跟transformers 發生版本衝突的問題了。這時候 Poetry 的價值就出來了，如果沒用 Poetry 的話不知道這個 bug 要什麼時候才能解決，呼應到我【Day 07】程式實戰前的準備提到的部分！

# 安裝 git+https://github.com/huggingface/parler-tts.git
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "mps" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

prompt = "Hey, The America men's basketball team won the gold medal at the 2024 Paris Olympics."
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
description = "Gary's voice is feel excited and happy, with a very close recording that almost has no background noise."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

程式碼結果探討 🧐：

經過昨天徹底 Tokenizer 和 Model 在做什麼之後，讓這個 code 變得很簡單。Prompt 就是設定要轉成語音的文字，而 description 就是要設定 Speaker。然後將其轉換成 tokens_id，讓語言模型理解我要他生成什麼樣的內容。最後就是將 Model 產生的結果解碼成音檔的型態並且輸出。
放兩個 description 的原因是玩看看不同的 Speaker，官方寫說 Speaker 總共有 34 人可以選擇。也可以設定 Speaker 的心情和語調，會產生不同的語音風格。

結論

圖像 & 影像生成：透過 Stable Diffusion 在硬體設備允許的條件下，可以很輕鬆的讓 AI 生成出我們想要的東西。
語音生成：這個部分選了兩個模型，一個是藉由載入資料集的方式來選擇 Speaker，另一個是僅透過簡單描述就可以選擇你要的 Speaker 和風格。我自己是比較喜歡後者那個模型，可以生成比較活潑的內容，比較有 conversation 的感覺，且程式碼也沒有比較難理解。